Week 6 - Hands on Exercise in Map Making and File Reading

Before we continue

This exercise quarto file will read a lot of csv files that you need to read into R so it is best to be organized.

Please see slides 9-11 of https://jmtfeliciano.github.io/DATA312Spring2025/Data312Spring2025_Lecture4#9 if you need a refresher on file reading.

Some of the read_csv() uses the local path method taught in Lecture 4, so you will have to hard wire the full path if you don’t follow best practices.

Let us revisit left_join and give you plenty of practice

I previously shared the image below in slide 33 of Lecture 5 last week to visually show you how left_join() works. We use left_join(data_frame1, dataframe2) to merge data from another data frame (table) into another data frame (table). We typically want to save the joined data into a data frame in our R environment.

So what you’ll typically see is: new_data_frame <- left_join(data_frame1, dataframe2)

There are many types of joins. But you may think of new_data_frame <- left_join(data_frame1, dataframe2) as supplementing data from data_frame2 into the underlying data of data_frame1 then saves the joined data into the R environment as new_data_frame. To learn more about joins (e.g., other joins such as inner_join), you may see a detailed lecture on joins I gave lecture last semester for DATA 412/612 here: https://jmtfeliciano.github.io/DATA412Fall2024/Data412612Lecture7

Important note: new_data_frame <- left_join(data_frame1, dataframe2) only works if and only if they share a common variable. Moreover, the common variable must also be of the same type (e.g., both are chr or both are numeric/double/integer).

Preliminaries Part 1

Run the script below to load the tidyverse package.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Practice 1: left_join() - 3 minutes

Important Note Above Code Blow: You don’t have to know tribble(). Just know I am creating data frames (one named comics_character, the other comics_publisher) into the R environment below

Let us run this to create comics_character:

comics_character <- tibble::tribble(
       ~name, ~alignment,  ~gender, 
   "Magneto",      "bad",   "male",  
     "Storm",     "good", "female", 
  "Mystique",      "bad", "female",  
    "Batman",     "good",   "male",   
     "Joker",      "bad",   "male",   
  "Catwoman",      "bad", "female",   
   "Hellboy",     "good",   "male", 
  )

head(comics_character, 10)
# A tibble: 7 × 3
  name     alignment gender
  <chr>    <chr>     <chr> 
1 Magneto  bad       male  
2 Storm    good      female
3 Mystique bad       female
4 Batman   good      male  
5 Joker    bad       male  
6 Catwoman bad       female
7 Hellboy  good      male  

Let us run this to create comics_publisher:

comics_publisher <- tibble::tribble(
       ~name,         ~publisher,
   "Magneto",          "Marvel",
     "Storm",          "Marvel",
  "Mystique",          "Marvel",
    "Batman",          "DC",
     "Joker",          "DC",
  "Catwoman",          "DC",
   "Hellboy",          "Dark Horse Comics"
  )

head(comics_publisher, 10)
# A tibble: 7 × 2
  name     publisher        
  <chr>    <chr>            
1 Magneto  Marvel           
2 Storm    Marvel           
3 Mystique Marvel           
4 Batman   DC               
5 Joker    DC               
6 Catwoman DC               
7 Hellboy  Dark Horse Comics

Using the R chunk below, create comics_character_v2 after pulling the publisher data from comics_publisher into the underlying data for comics_character. Afterward, use head() to visually confirm that comics_character_v2 is created as expected Important note: Before typing your code below, visually confirm from the printed data frame data above that both data frames have (a) a shared variable with the same name and (b) that variable (also called the join key) has the same variable type in both data frames.

# Place your code below
comics_character_v2 <- left_join(comics_character, comics_publisher)
Joining with `by = join_by(name)`
head(comics_character_v2, 10)
# A tibble: 7 × 4
  name     alignment gender publisher        
  <chr>    <chr>     <chr>  <chr>            
1 Magneto  bad       male   Marvel           
2 Storm    good      female Marvel           
3 Mystique bad       female Marvel           
4 Batman   good      male   DC               
5 Joker    bad       male   DC               
6 Catwoman bad       female DC               
7 Hellboy  good      male   Dark Horse Comics

Practice 2: left_join() - 3 minutes

Use the space below to load employees.csv and salaries.csv (both are under Week 6 on Canvas) using read_csv() into your R environment. Name these data frames however you want as long it follows best practice. Make sure to use head() for both data frames so you can study the underlying data.

# Place your code below
employees <- read_csv("data/employees.csv")
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Name
dbl (2): EmployeeID, Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(employees)
# A tibble: 6 × 3
  EmployeeID Name      Age
       <dbl> <chr>   <dbl>
1        101 Alice      25
2        102 Bob        30
3        103 Charlie    35
4        104 David      40
5        105 Emma       28
6        106 Frank      33
salaries <- read_csv("data/salaries.csv")
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): City
dbl (2): EmployeeID, Salary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(salaries)
# A tibble: 6 × 3
  EmployeeID City         Salary
       <dbl> <chr>         <dbl>
1        101 New York      60000
2        102 Los Angeles   65000
3        104 Chicago       70000
4        106 Houston       62000
5        107 Phoenix       67000
6        108 Philadelphia  68000

Next, using left_join() to combine salary data into the employees data. Save the resulting data frame into your R environment and you may name it however you want as long the name follows best practice. Important note (again): Before typing your code below, visually confirm from the printed data frame data above that both data frames have (a) a shared variable with the same name and (b) that shared variable (also called the join key) has the same variable type in both data frames.

# Now that we have confirmed join key is EmployeeID and they're the same type (dbl)

combined_data <- left_join(employees, salaries)
Joining with `by = join_by(EmployeeID)`
head(combined_data)
# A tibble: 6 × 5
  EmployeeID Name      Age City        Salary
       <dbl> <chr>   <dbl> <chr>        <dbl>
1        101 Alice      25 New York     60000
2        102 Bob        30 Los Angeles  65000
3        103 Charlie    35 <NA>            NA
4        104 David      40 Chicago      70000
5        105 Emma       28 <NA>            NA
6        106 Frank      33 Houston      62000

Join key have different names

Suppose we want to left_join() two data frames called data_frame1 and data_frame2. But the join key, let’s say is social security number, differs in spelling. How do you handle this? Before answering this, let us look at the data again for employee_ssn and salaries_ssn below:

employee_ssn <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/employees_ssn.csv")
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): social_security_number, Name
dbl (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
salaries_ssn <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/salaries_ssn.csv")
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): ssn, City
dbl (1): Salary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(employee_ssn)
# A tibble: 6 × 3
  social_security_number Name      Age
  <chr>                  <chr>   <dbl>
1 123-45-6789            Alice      25
2 234-56-7890            Bob        30
3 345-67-8901            Charlie    35
4 456-78-9012            David      40
5 567-89-0123            Emma       28
6 678-90-1234            Frank      33
head(salaries_ssn)
# A tibble: 6 × 3
  ssn         City         Salary
  <chr>       <chr>         <dbl>
1 123-45-6789 New York      60000
2 234-56-7890 Los Angeles   65000
3 456-78-9012 Chicago       70000
4 678-90-1234 Houston       62000
5 789-01-2345 Phoenix       67000
6 890-12-3456 Philadelphia  68000

What error do you get when you have no common variable? Run the script below to see the error mesage:

left_join(employee_ssn, salaries_ssn)
Error in `left_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ Use `cross_join()` to perform a cross-join.

An easy remedy: pipe one of the read_csv() into rename() so both data frames have the same name for the appropriate key join.

Practice 3 (3 mins)

Modify the code below by incorporating or piping one of the read_csv() statements into rename() to enable the left_join() below:

# BEFORE WE MADE CHANGES
employee_ssn <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/employees_ssn.csv") |>
  rename(ssn = social_security_number)

salaries_ssn <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/salaries_ssn.csv")

head(employee_ssn)
head(salaries_ssn)

employee_salary_data <- left_join(employee_ssn, salaries_ssn)

head(employee_salary_data)
# AFTER WE MADE THE CHANGE

employee_ssn <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/employees_ssn.csv") |>
  rename(ssn = social_security_number)
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): social_security_number, Name
dbl (1): Age

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
salaries_ssn <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/salaries_ssn.csv")
Rows: 10 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): ssn, City
dbl (1): Salary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(employee_ssn)
# A tibble: 6 × 3
  ssn         Name      Age
  <chr>       <chr>   <dbl>
1 123-45-6789 Alice      25
2 234-56-7890 Bob        30
3 345-67-8901 Charlie    35
4 456-78-9012 David      40
5 567-89-0123 Emma       28
6 678-90-1234 Frank      33
head(salaries_ssn)
# A tibble: 6 × 3
  ssn         City         Salary
  <chr>       <chr>         <dbl>
1 123-45-6789 New York      60000
2 234-56-7890 Los Angeles   65000
3 456-78-9012 Chicago       70000
4 678-90-1234 Houston       62000
5 789-01-2345 Phoenix       67000
6 890-12-3456 Philadelphia  68000
employee_salary_data <- left_join(employee_ssn, salaries_ssn)
Joining with `by = join_by(ssn)`
head(employee_salary_data)
# A tibble: 6 × 5
  ssn         Name      Age City        Salary
  <chr>       <chr>   <dbl> <chr>        <dbl>
1 123-45-6789 Alice      25 New York     60000
2 234-56-7890 Bob        30 Los Angeles  65000
3 345-67-8901 Charlie    35 <NA>            NA
4 456-78-9012 David      40 Chicago      70000
5 567-89-0123 Emma       28 <NA>            NA
6 678-90-1234 Frank      33 Houston      62000

Join key have different types

Before proceeding, run the following script which will will import two new data frames named employee_ssn_v2 and salaries_ssn_v2 into your R environment. Pay close attention to the social security number column and their data types for each data frame.

Note: You don’t really have to know what load() does, just that I may ask you to run a similar code for a future class, homework, or exam which allows you to import multiple data frames at the same time.

load(url("https://github.com/jmtfeliciano/teachingdata/raw/refs/heads/main/salaries_data.RData"))

head(employee_ssn_v2)
# A tibble: 6 × 3
  social_security_number Name      Age
                   <dbl> <chr>   <dbl>
1              123456789 Alice      25
2              234567890 Bob        30
3              345678901 Charlie    35
4              456789012 David      40
5              567890123 Emma       28
6              678901234 Frank      33
head(salaries_ssn_v2)
# A tibble: 6 × 3
  social_security_number City         Salary
  <chr>                  <chr>         <dbl>
1 123456789              New York      60000
2 234567890              Los Angeles   65000
3 456789012              Chicago       70000
4 678901234              Houston       62000
5 789012345              Phoenix       67000
6 890123456              Philadelphia  68000

What error do you get when you have no common variable? Run the script below to see the error message:

left_join(employee_ssn_v2, salaries_ssn_v2)
Joining with `by = join_by(social_security_number)`
Error in `left_join()`:
! Can't join `x$social_security_number` with `y$social_security_number`
  due to incompatible types.
ℹ `x$social_security_number` is a <double>.
ℹ `y$social_security_number` is a <character>.

Again the issue: social_security_number in employee_ssn_v2 is double (which is numeric) while social_security_number in salaries_ssn_v2 is character or string.

Two ways to fix this:

(a) Create another data frame that we will use for the merging after forcing social_security_number from employee_ssn_v2 into character type by using:

|>

mutate(social_security_number = as.character(social_security_number)) or

(b) Create another data frame that we will use for the merging after forcing social_security_number from salaries_ssn_v2 into numeric type by using:

|> mutate(social_security_number = as.numeric(social_security_number))

In class practice together

Let us implement one of the fixes together as a class. Remember, employee_ssn_v2 and salaries_ssn_v2 exist already. And goal is to make left_join() eventually work.

# In class practice together
salaries_corrected <- salaries_ssn_v2 |>
  mutate(social_security_number = as.numeric(social_security_number))

joined_data <- left_join(employee_ssn_v2, salaries_corrected)
Joining with `by = join_by(social_security_number)`
head(joined_data)
# A tibble: 6 × 5
  social_security_number Name      Age City        Salary
                   <dbl> <chr>   <dbl> <chr>        <dbl>
1              123456789 Alice      25 New York     60000
2              234567890 Bob        30 Los Angeles  65000
3              345678901 Charlie    35 <NA>            NA
4              456789012 David      40 Chicago      70000
5              567890123 Emma       28 <NA>            NA
6              678901234 Frank      33 Houston      62000

Map making recipe in Lecture 5

  1. We need a shapefile file which is a digital format for storing geographic location and associated attribute information (e.g., points, lines, or polygons). Think of boundaries, shapes, geometric information to delineate locations and boundaries. We will use the tigris package to get the shapefiles directly from the US Census Bureau, so we don’t have to work and manually load actual shapefiles (.shp) into R. Shapefiles loaded into R from tigris are both sf (simple feature which is geospatial data in R) and data frames.

  2. We need data to map we are interested in. For example, use read_csv() to read data you want to plot.

  3. Merge data (e.g., SVI data) to the shapefile (which is also a data frame) via left_join().

    Additional Note about the shapefile: The join key for the shapefile from the tigris package will always be GEOID of character (or string) type. If there is a mismatch on the join key (i.e., different spelling for join key, join key having different type), pipe the shapefile or read_csv() into (a) rename() or (b) mutate() with either as.character() or as.numeric(). If you are loading csv files into R, GEOID or FIPS will almost always be read as a numeric type into R, so either (i) convert GEOID from the shapefile into numeric type or (ii) convert GEOID from the file you read into character type.

  4. Leverage ggplot2 package to render the map by using geom_sf().

Preliminaries Part 2

Run the script below to load the tigris package:

library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.

Map 1: census tract data for a particular state

Our goal for our first map is to create a map for the 2019 social vulnerability for MO.

Step 1 of Map 1: use tracts() to get tract-level shapefile for the state of MO for 2019.

dc_tract_shp <- tracts(year = 2019, state = "MO") 

# important note: shapefile R objects don't show you variable/column data type via head() so you must use glimpse() for this
glimpse(dc_tract_shp)
Rows: 1,393
Columns: 13
$ STATEFP  <chr> "29", "29", "29", "29", "29", "29", "29", "29", "29", "29", "…
$ COUNTYFP <chr> "055", "055", "055", "055", "015", "229", "229", "229", "229"…
$ TRACTCE  <chr> "450302", "450102", "450200", "450400", "460400", "490300", "…
$ GEOID    <chr> "29055450302", "29055450102", "29055450200", "29055450400", "…
$ NAME     <chr> "4503.02", "4501.02", "4502", "4504", "4604", "4903", "4904",…
$ NAMELSAD <chr> "Census Tract 4503.02", "Census Tract 4501.02", "Census Tract…
$ MTFCC    <chr> "G5020", "G5020", "G5020", "G5020", "G5020", "G5020", "G5020"…
$ FUNCSTAT <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "…
$ ALAND    <dbl> 59019556, 215515312, 785265618, 518540939, 216350354, 3354059…
$ AWATER   <dbl> 54839, 158937, 714683, 475755, 11553444, 257629, 136121, 1863…
$ INTPTLAT <chr> "+38.0699995", "+38.1505661", "+37.9120761", "+37.8958096", "…
$ INTPTLON <chr> "-091.3834407", "-091.1929142", "-091.2086380", "-091.3892205…
$ geometry <POLYGON [°]> POLYGON ((-91.42897 38.0501..., POLYGON ((-91.31192 3…

Step 2 of Map 1: read data you want to load into R. We eventually want to plot the values of RPL_THEMES.

mo_svi_data <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/MissouriSVI2019.csv")
Rows: 1388 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): STATE, ST_ABBR, COUNTY, LOCATION
dbl (8): ST, STCNTY, FIPS, RPL_THEME1, RPL_THEME2, RPL_THEME3, RPL_THEME4, R...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(mo_svi_data)
# A tibble: 6 × 12
     ST STATE    ST_ABBR STCNTY COUNTY       FIPS LOCATION RPL_THEME1 RPL_THEME2
  <dbl> <chr>    <chr>    <dbl> <chr>       <dbl> <chr>         <dbl>      <dbl>
1    29 MISSOURI MO       29189 St. Louis 2.92e10 Census …     0.0123     0.067 
2    29 MISSOURI MO       29095 Jackson   2.91e10 Census …     0.0007     0.130 
3    29 MISSOURI MO       29095 Jackson   2.91e10 Census …     0.018      0.0252
4    29 MISSOURI MO       29095 Jackson   2.91e10 Census …     0          0.189 
5    29 MISSOURI MO       29189 St. Louis 2.92e10 Census …     0.0079     0.219 
6    29 MISSOURI MO       29095 Jackson   2.91e10 Census …     0.0346     0.0159
# ℹ 3 more variables: RPL_THEME3 <dbl>, RPL_THEME4 <dbl>, RPL_THEMES <dbl>

Step 3 of Map 1: use left_join() to add data you loaded in R into the shapefile. Modify dc_tract_shp and mo_svi_data to enable join. Note: In this example, we piped mo_svi_data into rename(GEOID = FIPS) because the join key in the shapefile is GEOID, so we renamed FIPS into GEOID within the mo_svi_data data for the to enable the join.

# converts GEOID in the shapefile into numeric type instead of character
dc_tract_shp_corrected <- dc_tract_shp |>
  mutate(GEOID = as.numeric(GEOID))

# renames FIPS into GEOID to enable join
mo_svi_data_corrected <- mo_svi_data |>
  rename(GEOID = FIPS) 

# perform join
dc_tract_shp_with_data <- left_join(dc_tract_shp_corrected, mo_svi_data_corrected)
Joining with `by = join_by(GEOID)`
# visualize data
glimpse(dc_tract_shp_with_data)
Rows: 1,393
Columns: 24
$ STATEFP    <chr> "29", "29", "29", "29", "29", "29", "29", "29", "29", "29",…
$ COUNTYFP   <chr> "055", "055", "055", "055", "015", "229", "229", "229", "22…
$ TRACTCE    <chr> "450302", "450102", "450200", "450400", "460400", "490300",…
$ GEOID      <dbl> 29055450302, 29055450102, 29055450200, 29055450400, 2901546…
$ NAME       <chr> "4503.02", "4501.02", "4502", "4504", "4604", "4903", "4904…
$ NAMELSAD   <chr> "Census Tract 4503.02", "Census Tract 4501.02", "Census Tra…
$ MTFCC      <chr> "G5020", "G5020", "G5020", "G5020", "G5020", "G5020", "G502…
$ FUNCSTAT   <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S",…
$ ALAND      <dbl> 59019556, 215515312, 785265618, 518540939, 216350354, 33540…
$ AWATER     <dbl> 54839, 158937, 714683, 475755, 11553444, 257629, 136121, 18…
$ INTPTLAT   <chr> "+38.0699995", "+38.1505661", "+37.9120761", "+37.8958096",…
$ INTPTLON   <chr> "-091.3834407", "-091.1929142", "-091.2086380", "-091.38922…
$ ST         <dbl> 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29, 29,…
$ STATE      <chr> "MISSOURI", "MISSOURI", "MISSOURI", "MISSOURI", "MISSOURI",…
$ ST_ABBR    <chr> "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO",…
$ STCNTY     <dbl> 29055, 29055, 29055, 29055, 29015, 29229, 29229, 29229, 292…
$ COUNTY     <chr> "Crawford", "Crawford", "Crawford", "Crawford", "Benton", "…
$ LOCATION   <chr> "Census Tract 4503.02, Crawford County, Missouri", "Census …
$ RPL_THEME1 <dbl> 0.9798, 0.7815, 0.9027, 0.7808, 0.9041, 0.8472, 0.8594, 0.7…
$ RPL_THEME2 <dbl> 0.9114, 0.9272, 0.3509, 0.9063, 0.5274, 0.9568, 0.9856, 0.6…
$ RPL_THEME3 <dbl> 0.6614, 0.3105, 0.0375, 0.0879, 0.0187, 0.0468, 0.4517, 0.0…
$ RPL_THEME4 <dbl> 0.8271, 0.7082, 0.5317, 0.6376, 0.2305, 0.8170, 0.8847, 0.3…
$ RPL_THEMES <dbl> 0.9690, 0.7952, 0.5537, 0.7123, 0.4852, 0.8147, 0.9466, 0.4…
$ geometry   <POLYGON [°]> POLYGON ((-91.42897 38.0501..., POLYGON ((-91.31192…

Step 4 of Map 1: use ggplot() to plot map. Again, we want to plot the values of RPL_THEMES. Best to add theme_bw(), theme_minimal(), or theme_void() customization.

ggplot(dc_tract_shp_with_data, aes(fill = RPL_THEMES)) + 
  geom_sf() + 
  theme_void() + 
  scale_fill_gradient(low="white", high="blue4") +
  labs(fill='MO-Specific SVI')

A condense version of the code above where corrections are done via piping (preferred):

library(tigris)

# Save shapefile and makes GEOID numeric instead of character type
dc_tract_shp <- tracts(year = 2019, state = "MO") |>
  mutate(GEOID = as.numeric(GEOID))

# Loads SVI data and rename FIPS into GEOID to enable join
mo_svi_data <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/MissouriSVI2019.csv") |> 
  rename(GEOID = FIPS) 
Rows: 1388 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): STATE, ST_ABBR, COUNTY, LOCATION
dbl (8): ST, STCNTY, FIPS, RPL_THEME1, RPL_THEME2, RPL_THEME3, RPL_THEME4, R...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Perform actual join
dc_tract_shp_with_data <- left_join(dc_tract_shp, mo_svi_data)
Joining with `by = join_by(GEOID)`
ggplot(dc_tract_shp_with_data, aes(fill = RPL_THEMES)) + 
  geom_sf() + 
  theme_void() + 
  scale_fill_gradient(low="white", high="blue4") +
  labs(fill='MO-Specific SVI')

Practice 4 (15 mins)

For this class, your goal is to read dc_household_income_2023.csv and plot median_income which is the household median for income in DC for 2023. Use the R chunk below to create your map. You may additional R chunk as needed if desired or simply use the one space below to do all four steps of the map-making recipe. Reminder: When you load data via read_csv(), always check whether or not the join key is GEOID or called something else. You only need to pipe read_csv() into rename() if and only if the desired join key is not GEOID and is called something else. Also: please use another color gradient (don’t use white and blue4). Please see page 3 of https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf for available colors.

# Place your code below

# Step 1:
dc_shapefile <- tracts(year = 2023, state = "DC") |>
  mutate(GEOID = as.numeric(GEOID))

glimpse(dc_shapefile)

# Step 2:
income_data <- read_csv("data/dc_household_income_2023.csv")
head(income_data)

# Step 3
dc_shapefile_with_data <- left_join(dc_shapefile, income_data)
glimpse(dc_shapefile_with_data)

# Step 4
ggplot(dc_shapefile_with_data, aes(fill = median_income)) +
  geom_sf() +
  theme_void() +
  scale_fill_gradient(low = "coral3", high = "slategray1") +
  labs(fill = "Household Median\nIncome For DC 2023")

# \n tells R to bring the rest of the legend to the next line
Rows: 206
Columns: 14
$ STATEFP  <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11", "11", "…
$ COUNTYFP <chr> "001", "001", "001", "001", "001", "001", "001", "001", "001"…
$ TRACTCE  <chr> "005302", "004402", "010602", "000804", "003902", "000903", "…
$ GEOID    <dbl> 11001005302, 11001004402, 11001010602, 11001000804, 110010039…
$ GEOIDFQ  <chr> "1400000US11001005302", "1400000US11001004402", "1400000US110…
$ NAME     <chr> "53.02", "44.02", "106.02", "8.04", "39.02", "9.03", "10.04",…
$ NAMELSAD <chr> "Census Tract 53.02", "Census Tract 44.02", "Census Tract 106…
$ MTFCC    <chr> "G5020", "G5020", "G5020", "G5020", "G5020", "G5020", "G5020"…
$ FUNCSTAT <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "…
$ ALAND    <dbl> 117635, 274746, 542512, 2566768, 266513, 928101, 1471925, 170…
$ AWATER   <dbl> 0, 0, 0, 167978, 11205, 0, 0, 516665, 0, 0, 5261, 0, 0, 0, 0,…
$ INTPTLAT <chr> "+38.9124517", "+38.9155152", "+38.9033936", "+38.9221746", "…
$ INTPTLON <chr> "-077.0386443", "-077.0270354", "-076.9994636", "-077.0918347…
$ geometry <POLYGON [°]> POLYGON ((-77.04167 38.9115..., POLYGON ((-77.03195 3…
Rows: 201 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): variable
dbl (3): GEOID, median_income, moe

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 4
        GEOID variable   median_income   moe
        <dbl> <chr>              <dbl> <dbl>
1 11001000101 B19013_001        135708 43153
2 11001000102 B19013_001        159583 70430
3 11001000202 B19013_001        152059 60482
4 11001000300 B19013_001        174470 21374
5 11001000400 B19013_001        188929 71412
6 11001000501 B19013_001        109116 18172
Joining with `by = join_by(GEOID)`
Rows: 206
Columns: 17
$ STATEFP       <chr> "11", "11", "11", "11", "11", "11", "11", "11", "11", "1…
$ COUNTYFP      <chr> "001", "001", "001", "001", "001", "001", "001", "001", …
$ TRACTCE       <chr> "005302", "004402", "010602", "000804", "003902", "00090…
$ GEOID         <dbl> 11001005302, 11001004402, 11001010602, 11001000804, 1100…
$ GEOIDFQ       <chr> "1400000US11001005302", "1400000US11001004402", "1400000…
$ NAME          <chr> "53.02", "44.02", "106.02", "8.04", "39.02", "9.03", "10…
$ NAMELSAD      <chr> "Census Tract 53.02", "Census Tract 44.02", "Census Trac…
$ MTFCC         <chr> "G5020", "G5020", "G5020", "G5020", "G5020", "G5020", "G…
$ FUNCSTAT      <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "S", "…
$ ALAND         <dbl> 117635, 274746, 542512, 2566768, 266513, 928101, 1471925…
$ AWATER        <dbl> 0, 0, 0, 167978, 11205, 0, 0, 516665, 0, 0, 5261, 0, 0, …
$ INTPTLAT      <chr> "+38.9124517", "+38.9155152", "+38.9033936", "+38.922174…
$ INTPTLON      <chr> "-077.0386443", "-077.0270354", "-076.9994636", "-077.09…
$ variable      <chr> "B19013_001", "B19013_001", "B19013_001", "B19013_001", …
$ median_income <dbl> 138241, 177902, 175687, 250001, 114700, 250001, 222692, …
$ moe           <dbl> 33146, 35204, 21551, NA, 15081, NA, 46518, 70430, 41567,…
$ geometry      <POLYGON [°]> POLYGON ((-77.04167 38.9115..., POLYGON ((-77.03…

Map 2: county data for a particular state

You need to follow the same recipe but this time, we use counties(year = [year], state = "[state abbreviation]") for the shape file.

Our goal for our first map is to create a county-level Virginia (VA) map for the county-level median household income using data from va_household_income_2020.csv

This time, we need to plot estimate, which represents the county-level data median income for VA.

Trying new (optional) customization this time: Let us add these new customization (a) add labels = scales::comma inside scale_fill_gradient() and (b) + theme(legend.position = c(0.2, 0.8)) at the bottom of the ggplot code then see what they do! Note: theme(legend.position = c(0.2, 0.8)) shifts the legend from the lower left corner of the image by 0.20 ‘units’ to the right (think of x-axis in Math) and 0.8 ‘units’ higher Again, these are optional.

# Step 1: Load shapefile 
# Note: We can just pipe any corrections into mutate() directly
va_shapefile <- counties(year = 2020, state = "VA") |>
  mutate(GEOID = as.numeric(GEOID))

# Step 2: Load VA county-level household income for 2020. 
va_income_data2020 <- read_csv("data/va_household_income_2020.csv")

head(va_income_data2020)

# Step 3: Perform the join
va_shapefile_with_data <- left_join(va_shapefile, va_income_data2020)

# Step 4: Create the map
ggplot(va_shapefile_with_data, aes(fill = estimate)) +
  geom_sf() +
  theme_void() + 
  scale_fill_gradient(low="white", high="blue4", labels = scales::comma) +
  labs(fill='VA Median Income') +
  theme(legend.position = c(0.2, 0.8))
Rows: 133 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): variable
dbl (3): GEOID, estimate, moe

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 4
  GEOID variable   estimate   moe
  <dbl> <chr>         <dbl> <dbl>
1 51001 B19013_001    46178  2575
2 51003 B19013_001    84643  3217
3 51005 B19013_001    48513  4275
4 51007 B19013_001    63918  6276
5 51009 B19013_001    57368  4546
6 51011 B19013_001    55457  5246
Joining with `by = join_by(GEOID)`
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

Pre-practice

Last week, we did a demo of the Small Area Health Insurance Estimates (SAHIE) dashboard where I quickly did a demo of downloading data from:

https://www.census.gov/data-tools/demo/sahie/#/

Let us quickly visit the website and download county-level data for Maryland (MD) for 2022.

Practice 5 (10 mins)

Data pre-processing before R: After downloading the file into your computer, open the csv file in your computer and delete the first few rows until the row of variable names is the first row. Note: The uninsured rate is the column called % .

R hates symbols for column names (makes it difficult to work with in R), so before saving the file, rename % into something meaning (e.g., uninsured_rate). R also hates white spaces in column or variable names (again, makes it difficult to work with in R), so delete white spaces from column names. For Mac users, saving your file via File > Save in Numbers (the default program for csv files in Mac) does not make a csv and actually creates what is called a Numbers spreadsheet, so you have to save the new csv file under File > Export To > CSV

Use the R chunk below to do your work. Like always, you may add additional R chunk if desired. Also: please use another color gradient (don’t use white and blue4). Please see page 3 of https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf for available colors.

# Place your code below
md_shp <- counties(year = 2022, state = "MD") |>
  mutate(GEOID = as.numeric(GEOID))

md_uninsured_data <- read_csv("data/SAHIE_02-18-2025.csv") |>
  rename(GEOID = ID)

md_shp_with_data <- left_join(md_shp, md_uninsured_data)
head(md_shp_with_data )

ggplot(md_shp_with_data, aes(fill = uninsured_rate)) +
  geom_sf() 
# add customizations yourself later
New names:
Rows: 25 Columns: 9
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): Name dbl (6): Year, ID, Margin of Error...5, uninsured_rate, % Margin of
Error, M... num (2): Number...4, Number...8
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(GEOID)`
• `Number` -> `Number...4`
• `Margin of Error` -> `Margin of Error...5`
• `Number` -> `Number...8`
• `Margin of Error` -> `Margin of Error...9`
Simple feature collection with 6 features and 25 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -79.06756 ymin: 37.99421 xmax: -74.98628 ymax: 39.72304
Geodetic CRS:  NAD83
  STATEFP COUNTYFP COUNTYNS GEOID      NAME         NAMELSAD LSAD CLASSFP MTFCC
1      24      047 01668802 24047 Worcester Worcester County   06      H1 G4020
2      24      001 01713506 24001  Allegany  Allegany County   06      H1 G4020
3      24      510 01702381 24510 Baltimore   Baltimore city   25      C7 G4020
4      24      015 00596115 24015     Cecil     Cecil County   06      H1 G4020
5      24      005 01695314 24005 Baltimore Baltimore County   06      H1 G4020
6      24      013 01696228 24013   Carroll   Carroll County   06      H1 G4020
  CSAFP CBSAFP METDIVFP FUNCSTAT      ALAND    AWATER    INTPTLAT     INTPTLON
1  <NA>   <NA>     <NA>        A 1213156432 586531107 +38.2221332 -075.3099315
2  <NA>   <NA>     <NA>        A 1093489886  14710932 +39.6123134 -078.7031037
3  <NA>   <NA>     <NA>        F  209649330  28758743 +39.3000324 -076.6104761
4  <NA>   <NA>     <NA>        A  896912534 185281256 +39.5623537 -075.9415852
5  <NA>   <NA>     <NA>        A 1549740652 215957832 +39.4431666 -076.6165693
6  <NA>   <NA>     <NA>        A 1159355858  13112464 +39.5633280 -077.0153297
  Year                 Name Number...4 Margin of Error...5 uninsured_rate
1 2022 Worcester County, MD       2686                 373            7.1
2 2022  Allegany County, MD       2960                 424            6.1
3 2022   Baltimore city, MD      30030                2908            6.6
4 2022     Cecil County, MD       4676                 603            5.4
5 2022 Baltimore County, MD      40013                3482            6.0
6 2022   Carroll County, MD       6130                 745            4.3
  % Margin of Error Number...8 Margin of Error...9
1               1.0      37991                   0
2               0.9      48441                   0
3               0.6     456931                   0
4               0.7      86661                   0
5               0.5     668068                   0
6               0.5     141937                   0
                        geometry
1 MULTIPOLYGON (((-75.07849 3...
2 MULTIPOLYGON (((-78.42042 3...
3 MULTIPOLYGON (((-76.71151 3...
4 MULTIPOLYGON (((-75.77209 3...
5 MULTIPOLYGON (((-76.69766 3...
6 MULTIPOLYGON (((-77.31156 3...

Map 3: state data nationwide

Again, this follows the prescribed recipe but with added complications. This time, we need to use states(year = [year])

Let us create the map using what we learned so far (but notice an issue with the plot):

# 'Bad code'
nationwide_shapefile <- states(year = 2021) 
    
nationwide_income <- read_csv("data/nationwide_household_income_2021.csv") |>
  rename(GEOID = fips_code)

nationwide_shp_with_data <- left_join(nationwide_shapefile, nationwide_income)

ggplot(nationwide_shp_with_data, aes(fill = estimate)) +
  geom_sf() +
  theme_void() + 
  scale_fill_gradient(low="white", high="blue", labels = scales::comma) +
  labs(fill='Household Income') 
Rows: 52 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): fips_code, variable
dbl (2): estimate, moe

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(GEOID)`

When you are mapping the entire country, we typically scale then place Alaska, Hawaii, and Puerto Rico under the contiguous United States. We can do this in R by piping state() into shift_geometry(). shift_geometry() only works for those three states/territories, but not for other territories such as Guam. Therefore, we need to remove them using filter()

Note: I am introducing the ! notation inside filter() so it is filtering for the opposite thing.

# 'Good code'
nationwide_shapefile <- states(year = 2021) |>
  filter(!NAME %in% c("American Samoa", 
                      "Commonwealth of the Northern Mariana Islands", 
                      "Guam", 
                      "United States Virgin Islands")) |>
  shift_geometry()
    
nationwide_income <- read_csv("data/nationwide_household_income_2021.csv") |>
  rename(GEOID = fips_code)

nationwide_shp_with_data <- left_join(nationwide_shapefile, nationwide_income)


ggplot(nationwide_shp_with_data, aes(fill = estimate)) +
  geom_sf() +
  theme_void() + 
  scale_fill_gradient(low="white", high="blue", labels = scales::comma) +
  labs(fill='Median Household Income') 
Rows: 52 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): fips_code, variable
dbl (2): estimate, moe

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining with `by = join_by(GEOID)`

Practice 6 (10 mins)

Plot data from breast_cancer_state_mortality_2023.csv to show the mortality rate for states across the US in 2023. The variable you is deaths_per_100000.

## WILL BE ASSIGNED FOR HOMEWORK

Plotting categorical data instead of numerical data

Suppose we want to map to indicate where states have expanded Medicaid access for their citizens as of December 2023. Let us view the data below before plotting:

nationwide2023 <- states(year = 2023) |>
  mutate(GEOID = as.numeric(GEOID)) |>
  filter(!NAME %in% c("American Samoa", "Commonwealth of the Northern Mariana Islands", "Commonwealth of the Northern Mariana Islands", "Guam", "United States Virgin Islands")) |>
  shift_geometry()


medicaid_expansion <- read_csv("data/medicaid_expansion.csv")
head(medicaid_expansion)

nationwide2023_with_data <- left_join(nationwide2023, medicaid_expansion)
head(nationwide2023_with_data)
Rows: 51 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): StateName, medicaid_expansion
dbl (1): GEOID

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 3
  GEOID StateName  medicaid_expansion
  <dbl> <chr>      <chr>             
1     1 Alabama    Not Adopted       
2     2 Alaska     Adopted           
3     4 Arizona    Adopted           
4     5 Arkansas   Adopted           
5     6 California Adopted           
6     8 Colorado   Adopted           
Joining with `by = join_by(GEOID)`
Simple feature collection with 6 features and 17 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -91843.51 ymin: -1354601 xmax: 2042471 ymax: 1323908
Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
  REGION DIVISION STATEFP  STATENS GEOID     GEOIDFQ STUSPS          NAME LSAD
1      3        5      54 01779805    54 0400000US54     WV West Virginia   00
2      3        5      12 00294478    12 0400000US12     FL       Florida   00
3      2        3      17 01779784    17 0400000US17     IL      Illinois   00
4      2        4      27 00662849    27 0400000US27     MN     Minnesota   00
5      3        5      24 01714934    24 0400000US24     MD      Maryland   00
6      1        1      44 01219835    44 0400000US44     RI  Rhode Island   00
  MTFCC FUNCSTAT        ALAND      AWATER    INTPTLAT     INTPTLON
1 G4000        A  62266499712   489003081 +38.6472854 -080.6183274
2 G4000        A 138963763779 45970528648 +28.3989775 -082.5143005
3 G4000        A 143778366814  6216688589 +40.1028754 -089.1526108
4 G4000        A 206244555303 18937471947 +46.3159573 -094.1996043
5 G4000        A  25151736098  6979330958 +38.9466584 -076.6744939
6 G4000        A   2677763372  1323686976 +41.5964850 -071.5264901
      StateName medicaid_expansion                       geometry
1 West Virginia            Adopted MULTIPOLYGON (((1548777 354...
2       Florida        Not Adopted MULTIPOLYGON (((1318717 -13...
3      Illinois            Adopted MULTIPOLYGON (((701650.5 11...
4     Minnesota            Adopted MULTIPOLYGON (((50386.4 128...
5      Maryland            Adopted MULTIPOLYGON (((1718035 379...
6  Rhode Island            Adopted MULTIPOLYGON (((2002871 667...

Important note for plotting categorical data: Do not use scale_fill_gradient() as we have been doing since that is for plotting numerical values. Let us plot our image before customizing our map.

ggplot(nationwide2023_with_data, aes(fill = medicaid_expansion)) +
  geom_sf() +
  theme_void() + 
  labs(fill='Median Household Income') 

How do you customize color? First determine how many unique values you have (e.g., using distinct()) and use scale_fill_manual(values = c([your colors here)) In the case of medicaid expansion, there are only two options. In this example, let us use scale_fill_manual(values = c("cornflowerblue", "darkslategrey")) where we specify two colors:

ggplot(nationwide2023_with_data,
       aes(fill = medicaid_expansion)) +
  geom_sf() +
  theme_void() + 
  scale_fill_manual(values = c("cornflowerblue", "darkslategrey")) +
  labs(fill='Medicaid Expansion\nAs of Dec 2023') 

# ONE BRIEF NOTE: within labs(), I used \n.
# \n tells R you want the rest on the next line

Note: if the lack of data for Puerto Rico is bothersome to you, feel free to add “Puerto Rico” inside filter(!NAME %in% c([list here])). You might have to do this often because often, national data sets exclude Puerto Rico.

Practice 7 (10 mins)

Plot data from presidential_election1964.csv to plot which states won by Barry Goldwater vs Lyndon B. Johnson in the 1964 election. Victor is the variable you’d want to plot here. Note: the map of states hasn’t really changed since 1959 when Hawaii because the 50th state. Also, states(year = 1964) will not work since it does not exist in the API where tigris pulls the data from, please use any year for the state map in this practice (e.g., states(year = 2020))

## WILL BE ASSIGNED FOR HOMEWORK

Package for ‘Dark Mode’ images: ggdark

You may use the ggdark package as demonstrated below. This works for all ggplot (not just maps).

Instead of using theme_minimal() or theme_void(), you need to use dark_theme_minimal() or dark_theme_void().

# run install.packages("ggdark") in console if not installed
library(ggdark)

ggplot(nationwide2023_with_data, aes(fill = medicaid_expansion)) +
  geom_sf() +
  dark_theme_minimal() + 
  labs(fill='Medicaid Expansion\nAs of Dec 2023') 
Inverted geom defaults of fill and color/colour.
To change them back, use invert_geom_defaults().

Rest of the class

Finish going over Lecture 5 slides from slide 44:

https://jmtfeliciano.github.io/DATA312Spring2025/Data312Spring2025_Lecture5#44